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© Completes on a another CPU the execution of a 
program, or program task, terminated by a processor 
error on a first CPU without re-executing any 
successfully-completed instructions and without any 
abnormal ending b^ing provided to the program. The 
continued program heed not have any built-in recov- 
ery or correction code. Predetermined register con- 
tents in the failed processor are stored (92, 93) in 
predetermined storage locations by the failing pro- 
cessor (CPUfj Br by a%srVic^j^6d^ 
the failing processor (CPUf) h^ not Been^abl^ \6 ! 
store ItWiS*^ 

saved from the ' failed processor are ddfinetf by the 
system architecture for saving an interruption of a 
program to 1 enable the cohtiniiatibh of execution of 
the program after restoring the contents of PSWs, 
CRs/*FPRs p GPRs, ARs, etc. if using the ESA/370 
architecture! When a failed processor (CPUf)' is de- 
tected, the SP issues (33) an external interruption to 
other processors (CPUh) in the system that are 
operable for continuing the execution of the failed 
processor task after the required information is 
stored. Special indicators (MCIC) are stored in pre- 
determined places (PSA) in the system and/or 
microcode memory that is accessible to the SP and 
to the healthy processors (CPUh) in the system 
selectable for continuing the task's execution. 



FROM FIG. 1 



FIG. 9 



SOLID; ERROR, 
DETECTED " 

91. CPUf SENDS OCCKSTOP, SIGNAL TO SP ; . 

92 . SP SI GNALS i CPU f ' TO STORE ITS REGISTER DATA ' . 

(GPRs. FPRs.CRs.ARs. ETC. .IN, CPUf). = 
• AND ALL CPUf OUTSTANDING STORES 



(NOT STORED 



(STORED) 



93. SP ACCESSES CPUf. I STORES ITS MC INTERRUPTION 

DATA. IN' LOGOUT' AREA 'IN - PSA 'OF CPUMlN'MS ■ •••'-< 

sdi vd beSsrnrnj^ r-.-- vo— . •..•;•<< "^■' :> ■ 

9^ SRiCOtftPLETES. A^Y.^^0^ OF CPUf , 



(NO ERROR)] 



(ERROR) 



95A . SET SLV BIT=1 95B . SET SLV BIT=0 



96. SP SETS VALIDITY BITS.PD BIT.CSLO BIT. ETC. 
^ . IN MCICf IN CPUf PSA 



97. SP SETS MC OLD RSW: IN CPUf PSA 

98. SP SETS CPUf IN CHECKSTOP STATE 



99. SP SIGNALS MFA TO ALL CPUs 
EXCEPT FOR CPUf (SEE FIG. 3) 



101. DOES A USABLE CPUh EXIST? - 
I YES 



TO FIG. V0 



102. ABEND CPUf 
TERMINATED 
. TASK • 
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Introduction 

The invention allows another CPU in a mul- 
tiprocessor configured system (MP) to continue the 
execution of a program without checkpointing, 5 
reexecution, or repeating execution of any corn- ■ 
pleted instructions in a program when its executing 
CPU failed during execution and .before its comple- 
tion. 

10 

Background 



In todays systems designed to operate with 
more than one CPU, when a processor detects an 
error, it attempts to correct the problem by a retry, 15 
such as by retrying the instruction in which the 
error occurred, or by re-executing the program in 
which the error occurred/Checkpoint retry recovery 
is available only if a program is designed to store 
checkpoint data at various times during its execu- 20 
tion. The retry techniques are limited to intermittent 
types of errors! and if a solid error occurs in the 
hardware, it will persist through all retry attempts, 
so a maximum number of retries is used and then 
a solid (uncorrectable) error is declared if the error 25 
remains. Detection of a solid error will cause the 
CPU to generate a machine check (MC) interrup- 
tion. 

The MC interruption signals the system control 
program and provides a MC new PS W (program 30 
status word) which addresses an entry instruction 
in a recovery manager program within the system 
control program. The system control program, then 
may attempt to re L execute the interrupted instruc- 
tion to see if the error condition goes away. If the 35 
error condition does not go away, the system con- 
trol program declares an abnormal end (ABEND) 
for the task that had its execution terminated by the 
errorj condition in its processor? 'Dependent on 1 trie 1 :£ 
type) of recovery jsupport^ built- into! the terminated 40 
program, it may or may not be able to recover. 
Often a program lacks the ability to recover when it 
is terminated at an unplanned point in its execution, 
even when it has not lost its input data. And when 
input data is lost due to an unplanned stoppage 45 
before execution is complete, programs using real 
time data (such as from a teller machine or a 
process control sensor) cannot recover their input 
data, and therefore the attempted recovery fails 
even when an intermittent hardware error is cor- so 
rected. 

The normal CPU operation of executing dis- 
patched tasks is ended by putting the CPU in a 
checkstopped state (stopping the CPU internal cy- 
cle clocks) if the re-execution of an instruction 55 
continues to fail, which determines a solid hard- 
ware error exists. The operating system software 
may maintain a retry threshold after which the CPU 
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is checkstopped. 

A checkstopped CPU is marked as a failed 
CPU by the system control program, so that it will 
not have any more program tasks dispatched on it. 

Summary of the Invention 



The subject invention is able to continue ex- 
ecution of most program tasks interrupted by a 
processor failure after all hardware retry attempts 
have failed. Thus, the invention is usable to con- 
tinue to successful completion the hardware termi- 
nated execution of an operating system program or 
an application program without any abnormal end 
(ABEND) being provided to the program. The use 
of this invention is hot dependent on the terminated 
program having any built-in recovery or correction 
code. 

Use of the invention can avoid the reexecution 
of any successfully-completed instructions in, or of 
any checkpoint retrying of, a program task that is 
terminated by a processor error: That is, the inven- 
tion is capable of finishing the terminated task oh 
another CPU for most errors that cause a termina- 
tion of the task. 

However, the invention prefers that a processor 
hot be removed from system operation unless 1 the 
terminating error is significant. In particular, an in- 
termittent error that may shortly go . away is com- 
mon in computer systems and are often are caus- 
ed by alpfta, particles. .The, invention recognizes .that, 
a hardware error condition may go away in . a short 
time so it allocs for a retry of the instruction haying 
an error, up to some , threshold number of times 
during which enough time has expired: to permit 
the error to go away, if it is an .intermittent type of 
WsCjltm if r the^er^ 

saved ,as ?a \System,resource and it can continue to^ 
befU^d,by,^he ^y^em.| ; .-y::-'v y; : : 

^.Further,,^ is 
improved by the subject jnyentiori even , when the 
terminated task cannot be continued after CPU 
failure, because, the i nyention is able ; to , Obtain 
information not .previously available for identifying 
system- resources being .used by the terminated 
task. The invention provides this information to the 
operating system so that it can obtain a, release of 
these system resources to allow the released re- 
sources to be used by other tasks (rather than 
remaining unusable by continuing, to be bound to 
an unrecoverable task)., .System effici<enqy n is de- 
pendent on. the efficient use of . its resources. t 

The subject invention requires unique modifica- 
tions in a service processor (SP) and in the operat- 
ing system (OS) software of, a system for perform- 
ing the novel methods required by the subject 
invention. Modification of the hardware or micro- 
code of the CPUs in a system is optional, depend- 
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ing on the architecture of the CPUs. 

The SP's recognition of failure' by another pro- 
cessor may be detected in any of a number of 
ways, such as the failing processor signalling its • 
failure to the SP by sending a special signal to one 
or more other processors, bh by the i SP detecting ^ ; 
that 'a ; - processor does not respond- to a special ' 
request; or by the OS detecting; that a processor 
nasi not done anything for a period of time with a 
task and performing the required operations. - 
In this invention the abnormal end (ABEND) 
process is avoided for tasks terminated 5 by ^proces- 
sor failure where in most cases ABENDS had to be 
used in the past. Instead the: job represented by ■ 
the terminated task is continued on another proces- v 
sor due to intervention by the SP; which accesses 
predetermined registers in the failed processor and 
stores their contents in predetermined memory lo^ 
cations when • the failing processor : has not been 
able to store this information. These predetermined 
registers in the failed processor are all registers 
required by the system architecture to be saved in 
memory on ah interruption of a program : for; en- 
abling the continuation of execution of the program 
after its interruption (e.g. storing and restoring the 
content of all of its PSWs, CRs, FPRs, GPRs, ARs, 
etc.). ^ru-::-:. ■• ■ - • -'^ • : 

When the SP detects a failed processor, the 
SP issues ani external interruption to other proces- 
sors in the system that may be usable for continu- 
ing the execution of the failed task. The external c 
interruption signal is sent after the SP or the failed 
processor has stored the required interruption in- 
formation and special indicators into predetermined 
places in system and/or microcode memory that is 
accessible to ^the SP and to the healthy processors 
in the system Selectable for conti nuing ; the >task f s* > 
execution, ok^!/. no^ne^qsi DiOM oteiqmo: 
^A 1 healthy ^processoip in^the system is selected % 
to continue the exe^uti6rt5<dfqthe^task^afte& its^ter^l 
minatiorP oriSthe rfailed 'processor; Tihe e selected 
healthy ; processor can be any operable • processor 
in the 'system which is dedicated to, or shared by;- 
the ; same operating system that was controlling the 
failed 1 processor. 1 --^^ .o:ja *v- 

The processor selection process may involve 
th e ' 1 normal i riterruption- operation in the* ^sy sterhy ■ 
whereby the first healthy processor to be enabled 
for external interruption will recognize and handle 
the interruption, and thereby continue the CPU- 
failed task from its point of interruption until the 
task is completed, or next interrupted. This inven- 
tion has found that tasks normally lost by being 
abended have instead been successfully complet- 
ed by this invention. 

; The register contents of the failed -processor 
are verified when they are saved by the SP and 
their validity indicated for determining if the termi- 



nated task can be continued. The verification is 
done, for example, by parity checking the content 
of each of the failed processor's registers as it is 
read for being saved, and setting a valid bit in a 
5 special memory area for each type of saved regis- 
ter. 

Even if the content of some of these failed 
processor's registers is not valid (which prevents 
the continuation of the terminated processor pro- 

10 gram), this invention still allows the stored register 
information which is valid to be used to identify 
some or all of the system resources assigned to 
terminated program to obtain a* release; of such 
resources to the system and thereby increase the 

75 size of the pools of resources available for use by 
subsequent tasks to increase the efficiency of sub- 
sequent system operations. 

This invention may require special support in 
the system hardware^ microcode and/or operating 

20 ■>- system (e.g. MVS, VM or PR/SM), and may pro- 
vide a unique "checkstop log out bit" in the logout 
area of the failing processor's program save area 
(PSA) in system storage to signal the existence;6f 
a failed processor's incomplete task needing a 

25 continuation of execution on another processor. 

The SP may also determine that because the 
hardware has become degraded due to past; errors 
resulting in performance degrading corrections 
(e.g. a failed portion of its cache has been decon- 

30 figured) to checkstop a CPU before it reaches a 
solid error condition due to further/ degradation 
beyond tolerable limits, and do a checkstop of the 
CPU and replace the parts that will correct its 
degradation problem . This is an error-preventative 

35 operation (since it is done before the processor hais 
an error); resulting in preventative maintenance, 
and^ it has a high probability of preventing- an error 
from" occurring in the CPU.CThis * decision>vcan be 
made during a task by using J this invention to 
40 complete the task on another =CPU. ? 

Description of the Drawings 

Fig. 1 : " 

45 is a flow diagram of the detection of a task 
termination due to a error occurring on any 
processor (CPUf) in a multiple processor (MP) 
system. Most of the process shown in this figure 
is prior art but it shows where an embodiment of 
50 the invention deviates from and bypasses tho 
prior process. ^ ■ - v 

Fig. 2 . - - - 

is a continuation of the prior art flow diagram 
started in Fig. 1. 
55 Fig. 3 

is a flow diagram biow-up of the malfunction 
alert (MFA) signalling step shown in Fig.. 1. 
Fig. 4 
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represents a multiprocessor system using the 

invention. 

Fig. 5 

represents an example of register contents in 
each CPU in an MP system available for use by 
the invention. 

Fig. 6 , .- : w:-.: , : . :.n*-.\>- w 

is a timing diagram of instruction ; processing 
showing an example of when an error might 
occur during processing of any instruction. 
Fig- 7- ^ - v; - r , 

shows a part of the prefix save area (PSA) in 
system main storage (MS) in which critical in- 
formation is stored by a machine cheek interrup- , 
tion of -any GPU in the system using the; 
ESA/370 architecture in a preferred embodiment 
of the invention. no 
Fig- 8A k,^x\- • . -v,-. y 1 . ■ • . 
represents /the SIGP status block in the hard- 
ware storage area (HSA) of CPUf, and Fig. 8B 
represents ; the external interruption identifier 
block in the HSA of CRUh. : , ir : 

Figs. 9 and 10 ■ . 

provide a flow diagram of * processing steps used 
by the preferred embodiment of the invention. 

Detailed, Description of the Background Process 

The preferred embodiment of the invention is 
represented by a process that begins in Fig. 1 and 
continues in Figs. 9 through \A . Fijgs; 1 through 3 
mostly represent ^background prior art useful in 
understanding the invention. The invention is con- 
cerned with a program executing oni any GPU in a 
multiple processor (MP) system when a hardware 
condition occurs in any GPU and prevents the CPU 
from continuing execution of j its current program. 
The hardware condition; (referred $to ois ia? hardwares ; 
error) maynbe ia failure int ithe hardware circuits, or; 
may be in the micrdcddeDOfothe CPU^ The failure 
will likely occur during the execution of some in- 
struction in the program, but- it may, \ also occur 
during an interruption execution between instruction 
executions. . n i ■; 

Herein, a GPU having an error is referred to as 
CPUf,. meaning CPU(failure). Any operational CPU 
in the system, which does not have any, error is 
referred to as CPUh, meaning CPU(healthy). v 

Large computer systems dispatch programs in 
work units of program execution* called tasks. Each 
task may comprise one or more programs and data 
that execute together. The preferred embodiment 
of the invention was developed on an IBM ESA/370 
MP system, which has its architecture, described in 
a publication having form number SA22-7200 en- 
titled "ESA/370 Prihcp is incor- 
porated herein by reference into this specification 
in which chapters 4, 5, 6 and 11 are particularly 



pertinent. •;. - . t 

Fig. 4 shows an MP system in which the pre- 
ferred embodiments may , be used. It contains a 
plurality of CPUs 1 through N, ; and a service pro- 

5 . cessor (SR). However the SR function may be 
performed in any of CPUs IrN to • eliminate the 
need :f on . a separate processor. However the pre- : 
ferred embodiment prefers ^a separate processor, 
even though the invention comprehends not having 

10 a service processor and doing the SP steps in this 
invention using one of the CPUs. ; ^ , ^ i 

The MR in Fig^ 4> { has a system hardware ; : 
memory that includes a hardware part 41 referred 
to as the system main memory (MS) which con- 

75 • • tains all * absolute addresses 5 usable by the operat- 
ing system software (OS) and all ^applications pro- 
grams, (applications) that run on the system.. An- s 
other hardware part 42 referred to as the micro- 
code area (M A)< is reserved for microcode, use by: 

20 the CPUs and system. The MS contains the prefix 
areas for the respective CPUs ia the system which 
are accessed by the OS. The, MA contains respec- 
tive hardware save; areas for the CPUs accessed 
by microcode of -the- respective CPUs: . n ; t >r- h 

25 Fig. 5 represents the most important' registers 

in each CPU that need . to be saved in the PSA of 
the respective CPU in MS upon an interruption of 
the CPU's operation. These are not the only regis- 
ters needing to be saved upon interruption, .which 

30 are more precisely defined in the ESA architecture 
book referenced above in its chapter 11 sections 
entitled "Check-stop State".; "MaehineTCheck ln- : 
terruption",; r and r "Machine-Check-Interruption 
Code". Fig. 7 represents part of the, RS A. for any , * 

35 CPU upon the occurrence of a MC interruption, 
showing a blow-up of the MGIG field in the PSA. . 
The; representation inMFigio7 is exemplary^ and the 
complete MCIC representation is available in,; the ^ 
prior rA;art: above: : referenced section . entitled 

40 "Maehine-Checkr-lnterruption Code". &mj d u ; * ^ 
Theiinterruption signalling toi; the PSAs of the ^ 
processors ? isydone^ by the j external interruption 
defined in -a section entitled "External Interruption" 
in the previously referenced 'ESA/370 Principles of 

45 Operation chapter 6. Also, signal processor (SIGP) 
instruction operations are described, in chapter 4 
entitled" v CPU Signaling and Response" in this 
samebook. ■ > - xv ■:■.■■■> • \ v-m: 

Fig. 8A represents the SIGP status block in the 

so HSA for any CPU. A cheekstop field in the HSA 
block for any failed CPUf is = set by ; the SP. A SIGP 
instruction : to a CPUf indicates .when the respective 
CPU is in a cheekstop condition. 

Fig 8B represents the external interruption 
55 CPU ID block in the HSA for any CPU. The CPU ID 
field in this block receives the CPU ID of a failing 
CPUf set by the SP. • ; ; - 

The processing Figures show steps of a pro- 
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cess, and each step is given a reference number. 

Step 1 in Fig. 1 represents the execution of 
any instruction or interruption in any task by any 
CPU in a multiple processor system. Step 2 is the 
detection in any CPU of a hardware error condition; s 
so that the CPU becomes the CPUf in the MP 
system. 

All hardware errors are tracked in the MP sys- 
tem by the service processor (SP). Each time a 
hardware error is detected in any CPU, it is re- w 
ported v by the CPU by step 3 sending an error 
signal to the SPi - ^ ■ 

F\gl 6 is ah example of when a hardware error 
might occur during the execution of an instruction, 
in which the error is shown to occur during the ;s 
operand fetch and execution period. ^ 

When step 3 reports art instruction error to the 
SP, the SP classifies the error into one of three 
categories: retryable error, non-retryable error, or 
checkstop error. Most errors are retryable andthen 20 
step 5 is entered arid processed by CPUf. But 
there are error conditions that are not retryable, for 
example if an address error occurs in the prefix 
register of the CPUf, so that its PSA cannot be 
found: this prevents any interruption handling for 25 
such a CPUf, and no retry is possible for it, so that 
step 17 is immediately entered to checkstop that 
CPUf. If the prefix address is valid, then even 
though the error prevents the instruction from being 
retryable; an interruption can be set arid recog- 30 
nized for the CPUf. Step 10 then is performed in 
which the SP checks if the processor damage (PD) 
threshold has been exceeded, and if so the SP 
does a checkstop for CPUf, but if the threshold is 
not exceeded the SP merely sets the processor 35 
damage (PD) bit on and sets off the backup bit (B) 
in step 11. > ^ 

TKe r error is ^etehmihelj^-soiib^ioi intermitteht 
by step v 5 having CPUf c retr^^ 

an erroYPiHhe^ro?-^ 40 ' 

during ofiS Wtrfe^tr^lbdps, and th^ next in§true- 
tion is then executed, etc. until the task i^ success- 
fully cdmplete if ho error is' detected. 1 v 

A solid error is determined if the error persists 
through eaich retry of the instruction until the niim-i 45 
ber of retries exceeds some threshold number, 
called an instruction retry threshold, which is tested 
by step 5. Thus, if the error persists when the 
threshold is reached, the error is considered a solid 
error (it is unlikely to be corrected by the passage so 
of time). ' ; ; ' 

If a solid CPU hardware error occurs during a 
CPU interruption operation, there is no incomplete 
instruction to retry since interruption occurs be- 
tween instruction executions. And the p roces s will 55 
not branch back to step 1 to make any instruction 
retry effort: Instead the hardware in the system will 
attempt to recover the interruption by comparable 



prior art techniques and similarly may declare a 
solid error. But whether a solid error occurs during 
execution of an instruction or during an interruption 
operation, the current program on the failing CPUf 
is terminated in the failed CPUf. 

After a solid error condition is determined, the 
prior art process may be continued by entering 
step 6, or the novel process of the preferred em- 
bodiment of the invention may be performed by 
entering Fig. 9. However, this invention is better 
understood if the prior art process is first explained, 
so step 6 is assumed to be entered here. 

When the prior art process determines a solid 
error exists, step 6 increments a processor damage 
(PD) count and compares the count to a PD thresh- 
old. The PD count is the number of solid errors 
detected over some period of time, such as over 
an eight hour period. The PD count is incremented 
by one each time a solid error is determined, and 
the resulting number is compared to a PD thresh- 
old value which is the maximum number of solid 
errors allowed in a GPU over the chosen period of 
time, such as eight hours. If step 6 determines the 
threshold has hot been exceeded . step 7 is ex- 
ecuted. If the threshold has been exceeded, step 
12 is entered. 

In step 12 the SP checkstops CPUf; In step 13 
the SP sends a malfunction alert (MFA) signal to 
the other CPUs in the system; indicating to the 
other CPUs that CPUf is failing. MFA signalling 
step 13 is shown in detail in Fig 3. In step 14, the 
MFA signalling by the SP causes an external in- 
terruption for any of the other CPUs, lie. conven- 
tional PSW swapping is done in the external in- 
terruption area in the PSA in MS for the CPU. Any 
interruption-enabled CPU in the system can take 
the conventional external interruption, which uses ^ 
the new PSW in any other CPU PSA to address ah 
OS - routi ne 1 that Will AB END (terminate) the current 
task on CPUf. ' w ; ^ : -v ' v . 

Then, in step 1 5 OS continues to run the 
system but with the remaining healthy CPUs 
(without CPUf). 

Optional step 16 recognizes that some pro- 
grams have designed within their code the ability 
to recover from some types of error conditions, 
although most programs do not have any; or have 
inadequate, recovery capability. If the terminated 
program has a built-in recovery capability, it may 
attempt to use it to complete its execution. 

This invention does not use any internal recov- 
ery capability that may be built into any program, 
because the invention can continue the execution 
of a program independent of any internal recovery 
c apab ility. 

However, if step 6 finds the PD threshold has 
not been exceeded, step 7 is executed in which the 
SP sets the PD bit and the B bit within the the 
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MCIC (machine check interrupt code) field in the 
PSA ; f or CPUf. Then in step 8, the SP signals CPUf 
to provide the MC (machine check) interruption by 
storing its current CPUf PSW as the old MC PSW 
and to access the new MC PSW, which invokes 5 

step 21 in pg- 2. . : - - > . 

And step 9 requires assurance that all out- 
standing stores in CRUf (stores completed by , CPUf 
but not, yet put into MS) are made in MS, This can 
be done by CPUf sending all of its outstanding w 
stores onto, a bus to MS outside of CPUf which will 
not be affected when the operation of CPUfv is 
checkstopped. Step 9 can be done in parallel with 
any one or more of SP steps 6; 7 and 8 even 
though it is shown at the end of the process in Fig. 75 

In Fig. 2 step 21 uses the new PSW address; to 
enter an operating* system (OS) recovery routine 
that attempts to recover the terminated program on 
CPUf. Step 22 determines whether the hardware 20 
error • occurred during execution of an OS or ap- . 
plication program. If the error occurred during an 
OS program, step 23 is entered to determine how, 
pervasive the error damage is, and if it is a type ofv 
error that may affect the integrity of the system the 25 
"no" exit is taken to crash the system (i.e. termi- 
nate system pperat»on) to^allow manual intervention 
to correct the error. But if the error only affects the 
operations on CPUf or is correctable, then step; 23 
takes its "yes" exit to continue the OS and termi- 30 
nated program's execution. ... , ■ 

However, if step 22 finds the error is not in the 
OS software, but is in the; currently executing ; ap- . 
plication, then only the current application is ABEN- 
Ded, and the system continues operating , with the 35 
remaining CPUs (without CPUf). But the ABENDed 
task may not be recovered in this prior art; sce- 
nario^ se' : >'^>ha of f\y ^ -j : ?0 wdto v 1 ^ ni W8 c i w&n e ; 11 

f ^!9-u3 represents . how . MF$ (malfunction ; a!ert^ 
signalling is done by the SP in the prior art; which; 40 
is represented by t step 13 in Fig. 1 and is also 
used; in; step 99; in Fig: 9. The MFA process is 
started by a checkstop signal being issuedrby.the 
SP for the failing CPUf, ; , i K - 

Step 31 is ; entered in Fig. 3 in which, the SP 45 
writes the MC checkstop code into the; private HSA 
for CRUf. The. HSA is accessi ble ; on ly to ; microcode ; 
(and not to OS or to any application). This check- 
stop code tells CPUf that it is disabled and cannot 
operato. In step 32 the SP writes the CPUf iden- so 
tifier (ID) into the private HS As of every CPU in the 
system (except for CPUf), which indicates the fail- 
ure of CPUf to all healthy CPUs in the system. 
Then in step 33, the SP sends an MFA external 
interrupt signal to every CPU to tell them to take 55 
the interruption. Step 34 represents the first CPU to 
become enabled for external interruptions as the 
one of plural CPU's (if there are pluial CPU's in the 



system) which will take the MFA interruption, and 
thereafter the interruption is no longer available to 
any other CPU which later becomes enabled for 
interruption. 

Preferred Embodiment 



The process in Fig. 9 is invoked when a solid 
error is, detected by step 5 in. Fig. 1 after all retry 
efforts fail to correct the problem. In Fig. 9 step 91 
is entered, in which CPUf sends a checkstop signal 
to the SP, requesting that CPUf operation be stop- 
ped including stopping the cycle clocks in CPUf. 

Then in step 92, the SP signals ; GPUf to store 
the data contents in its registers. This includes 
storing all register data (e.g. the GPRs, FPfts, CRs, 
ARs, etc.) needed for interruption in the logout area 
of the PSA of CPUf. CPUf may or may not be 
successful in doing this register store, operation, 
depending on the typo of solid error existing in 
CPUf. It is preferable to have CPUf dp the. storing, 
if it is able, rather than SP, since usually the SP is 
a slower processor than CPUf. if the CPUf storing 
is successful, steps 93, 94 and 95. are skipped. i.-.^.r 

But if the CPUf storing is not successful,; steps 
93 is entered for the SP to perform the storing that 
CPUf failed to do in step 92. Then the SP does the : 
storing of this register information in the PSA 
logout area of CPUf, and step. 94- is entered in 
which the SP completes, any outstanding stores of r 
CPUf. The SP store operations of step 94 may or 
may not be successful, which is indicated -by set- 
ting on or off an SLV (store logical valid) flag bit; in 
the MClCf . If a store error occurs, the SLV bit is set 
to 0 in step 95B, and if no store error occurs, the 
SLV bit is set to 1 >in step 95A; and in either case, 
step 96 is entered. | qei<: 

^ln step^96 ? the^SF^setsi the^y^lidity flag bits, the 
PD (processor damage) bit ; and , , the - OSLO 
(checkstop logout) t bit to indicate . the ^checkstop 
condition of CPUf to the OS, when it examines the 
PSA for QPUf. Each validity flag bit is set on of off 
in step 96 for each -type of register iound to have 
error-free, or erroneous contents, respectively, in 
step 93 ? Accordingly, the set of valid bits Jo MClCf 
may indicate all types of saved register contents 
are error-free, or the MClCf set may indicate th?t 
less than all types of saved register contents are 
error-free. This invention takes different actions 
(represented near the end of Fig. 11) depending on 
whether the MClCf valid bits indicate all register 
contents were saved error-free, , or whether only 
some types, of register contents were saved error- 
free. ■ /. 

In step 97, the : SP saves the current CPUf 
PSW in the MC old PSW in the CRUf PSA, and in 
step 98 the SP sets CRUf into a checkstopped 
state in which CPUf has its cycle clocks stopped 
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so it can no longer function as a normal CPU. In 
step 99, the SP signals a malfunction alert (MFA) 
(shown in detail in FIGURE 3) to all CPUs except 
CPUf. 

Then step 101 is entered to determine if an 
operable GPU exists in the system on which the 
CPUf terminated task can be dispatched. The in- 
vention requires a system having plural CPUs, 
since this invention requires a switching of the 
terminated task from CPUf to a CPUh. However, 
some systems have restrictions on how CPUs can 
be dispatched. In some MP systems, any task can 
be dispatched on any CPU in the system, which 
provides maximum CPU flexibility and sharing. But 
in other MP systems, one or more of its CPUs may 
be dedicated to a single type of job or to only one 
of plural OS's:- An example of an MP system allow- 
ing ••• CPU dedication to a particular OS is an 
ESA/370 ^ multiple CPU system using the < IBM 
PR/SM hypervisor. - ; , 

If no healthy CPU is available to continue the 
interrupted task from the failing CPUf, the "no w exit 
from step 101 is taken to step 102 to ABEND the 
CPUf terminated task since no CPU resource is 
available to continue its execution. But the "yes" 
path is taken from step 101 to exit 10 to proceed 
with the process of enabling the continuation of the 
execution of the terminated CPUf- task if a CPUh is 
available, and Fig. 10 is entered. ) 

s Step 111 in Fig. 10 involves the selection of 
one among the one or more ^operational CPU(s) to 
be available for continuing- the execution of the 
CPUf terminated task. One or two CPUs may be 
involved in the performance of step 1 11. The first 
operational GPU available to take an outstanding 
external interruption will take this MFA external 
interruption. Then the interrupted CPU: receives the 
CPUf llPjxevidusly^ ?pG£Mfitb* its^HSAiBy) step* 32, 
senses 'the 'checkstop^fii§ld>^f6r ^CPUf C'in' thk SIGP 
statui blo'ck^-i^ 

stopp state of " CPUf , ^hd assigns one of thel oper- 
ational CPUs as CPUh. CPUf Ms identified using a 
SIGP instruction which has microcode that reads 
the CPU ID field in block 82 in FIGURE 8B: 

Then in %tep< 11 2] the OS routine on CPUh 
reads the MCIC in the PSA for CPUf (i.e. MClCf). In 
step 113 the OS routine tests the state of the 
CSLO flag bit in MClCf. The CSLO bit is new in 
this embodiment of the invention; and if the old 
process explained with Figs. 1-3 is used, the CSLO 
bit will not be set on, and the "no*? path is taken to 
step 120 which ABENDs the CPUf terminated task. 

But the "yes" path to step 114 is the normal 
path used 6y this embodiment because the CSLO 
bit will be set on when CPUf failed. In step 114 the 
OS tests the state of the validity bits in MClCf, and 
if any validity bit indicates its register type is not 
validly stored, the "no" path to step 117 is taken in 



which OS ABENDS the CPUf terminated task. Then 
optional step 118 in this novel embodiment enables 
OS to link the PSAf saved contents of the CPUf 
registers to its prematurely terminated application 

5 program, for example, by OS putting the saved 
data in a file and linking it to the CPUf terminated 
program. Then the next time OS schedules that 
program for dispatch in a new task, the program 
will have the availability of the interruption data 

w acquired during the ABENDing of its last execution, 
which data can be used to recover; correct, or 
make more efficient the complete execution of the ' 
program to obtain the required results. ! 
And in step 119 after ABENDing the CPUf task, 

75 this invention provides the saved contents of the 
CPUf registers (and particularly the contents of the 
control registers) to OS which analyzes the -con- 
tents to identify the system resources tied up by 
the CPUf terminated task, and then OS releases 

20 these resources so they can be reallocated to other 
tasks. This OS release of resources makes more 
resources available in the system for allocation to '•' 
future tasks to enable the overall system to operate 
more efficiently. ' ; 

25 But most of the time, step 114 will find all of 

the validity bits in MClCf are set on because all 
interruption information has been validly stored, 
and then step 115 is entered. In step 115, OS 
dispatches the CPUf terminated task on CPUh, for 

30 which OS fetches the saved contents of the CPUf 
registers in the CPUf PSA and loads these con- 
tents into the corresponding registers in CPUh, and 
OS sets up the current PSW for CPUh by loading it 
with the saved MC old PSW to enable the task 

35 execution to continue starting with the address of 
the instruction in the task following the last instruc- 
tion correctly executed ; after whichothe task^ was > 
terminated by the CPUf failure, t * :>s no 

40 - Claims 

1. A method of continuing the execution of a 
program or program task which is terminated 
before completion when its executing proces- 
45 sor (CPUf) fails due to a hardware condition, 

comprising the steps of: 

copying (92) contents of registers in the failing 
processor (CPUf) into storage to store a pre- 
50 determined program continuation interruption 

state when the processor detects a hardware 
failure condition; 

sending (33) a signal identifying the failing 
55 processor (CPUf) to at least one other proces- 

sor (CPUh)rand 

interrupting (34) the operation of a processor 
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(QPUh) receiving the signal, and selecting 
(111) a processor (CPUh) to continue execu- 
tion of the program or program task, and load- 
ing (112) into the selected processor, (CPUh) 
from storage the stored program continuation .; 5 
interruption state of the failed processor (CPUf) . 
toi continue execution of the program or pro- 
gram task from a last successfully executed : : 
instruction without having any abnormal end 
indicated for the program or program task. 10 

2. A method of continuing the: execution of a 
program as defined in claim 1 i further compris- 
ing: \; • .< ■.>;■■ ^'^ :; --\ : . ■ 

setting (32) an indicator field in storage for 
identifying the failing processor { (GPUf) to an ••. 
operating system and for indicating the failing 
processor (GPUf) is terminating * operation for 
the, program or program task- 20 

3. A method of continuing the execution of a 
program as defined in claim 1, further compris- 
ing: ,\j:<K*:*:M^ 

signalling (91) by the failing processor to a 
service processor (SP) of the processor failure; 
and- \ ■ , -v.. ?" ; m c :'^ ^v..; : . 

the copying (92) of the contents of the regis- 30 
ters in the failing processor (GPUf) being done 
by , the k failing processor (GPUf) if the failing 
j) processor; is able, but then copying (93, 94) 
being jdone ; by the service, processor (SP) if 
the failing processor (GPUf) is not able. 35 : . 

4. A method; of: continuing the execution of a 
program as defined in ^-Glaim 3^ r f ur^her ?comr • , 
prising: 

signalling (99) by the service processor (SP) of 
a] malfunction alert for the? failing processor j 
(CRUf) to at least one operational processor 
(GPUh) to request the operational processor 
(GPUh) to continue processing the, terminated 45 y 
program. : i 

5. A method of continuing the execution of a 
program as defined in claim 3, further compris- 
ing: ; 50 

the storing (92) of the register contents being 
done in a logout area in system main storage 
(MS) assigned for use by the failing processor 
(GPUf). = ; ; 55 

6. A method of continuing the execution of a 
program as defined in claim 3, further compris- 



ing: 

setting by the service processor (SP) of flag 
bits in a storage area accessible by a system 
control program controlling the operation of the 
failing processor (CPUf) and at least one other 
operational processor (CPUh). ; . . : Ar 

7. A method of continuing the execution of av 
program as defined in claim 3, further compris- 
ing: : ■ ■ ^.'•iUvv. v.-- ••••• • * 

stopping the operation of the failing processor 
(CPUf) to allow maintenance; and ; r 

setting by the service processor (SP) of a •• 
processor-stopped indicator .field in a hardware ; 

' storage: area accessible > by 

microcode/processor hardware operation but 
not by any operating system or application 
program, and stopping the failed processor 

j (CPUf) operation.; o^Hnr •; r : -. 

8. A method of -continuing the execution of <a v 
program as defined in claim 7, further compris- 
ing: ; ' v*- ::: .r ;0 ' . ••• J •:. 

assuring (94) that ? alb outstanding stores, done 
by the failing processor? (CPUf) are at least on 
a bus to memory and external of the failing 
processor (GPUf) before stopping the operation 
of the failing processor (CPUf). 

9. A method of continuing the execution of a! 
program as defined in claim 8, further cprnpris- 
ing: ■ y, • - , ■■ • r. 

setting (96) /by, ithe; service. prpcesspr^fSJ?) in 
'the* storage, area .accessible ; by . ?the system . 
control u program vO^ for, 
conditions required, for continuing; execution of 
the terminated rprogram. : ; wc - 

10. A method of continuing the execution of a 
program as defined in claim 9, further compris- 
ing: .-. . - . ■ ^ .--iv 

setting by the service processor (SP) in the 
storage area accessible by the system control 
program of validity indicator bits for the copied 
register contents. , = : .y • - v 

11. A method of continuing the execution of a 
; program as defined in claim . 1 0, further com- 
prising: ; . 

setting (97) by the service processor (SP) in 
: the storage area accessible by the system 
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control program of a machine check old PSW 
(program status word) for indicating the ad- 
dress in the terminated program from the 
failed processor (CPUf) wherie the continuation 
of instruction processing is to occur. 

12, A method of continuing the execution of a 
program as defined in claim 4, the signalling of 
a malfunction alert further comprising: 



transmitting ah external interruption signal to 
all operational processors (CPUh). 

14. A method of continuing the execution of a 
program as defined in claim 12, further com- 
prising: 



70 



sending (33) the malfunction alert signal to all 
operational processors (CPUh) to request at 
least one of the dperational processdrs (CPUh) 
to continue processing the terminated pro- 
gram, is 

13. A method of continuing the execution of a 
program as defined in claim 12, the malfunc- 
tion alert signal further comprising: 



20 



25 



storing (32) a malfunction alert signal with an 
identifier of 'th^feileid processor (CPUf) in a 
hardware storage area (HSA) accessible by oo 
microcode/processor hardware operation but 
not by any operating system or application 
1 program : assigned 4b* each operational proces- 
sor (cpbnj^ bl : ; ";; :nh . fe ;-- v ^ ;i - f . .7, 

15. A method of continuing the execution of a 
proqram as defined in claim 13. further com- 
prising: 



'"(CPtjf^ v wH^ ^ses t^ l MFA 

indication to allow mairf^ 
processor (CPUf); and 

setting (32) by . tW6 ^eryice processor of an 45 
identifier (ID) for the failed processor (CPUf) in 
the hardware stbraQ^ (HSA), and a se- 
lected operational processor (CPUh) sensing 
the identifier (ID) of the failed processor (CPUf) 
to determine its identification. so 

16. A method of continuing the execution of a 
program as defined in claim 12, further com- 
prising: 



55 



testing (101) the number of operational proces- 
sors (CPUh) in the system to determine if 
there is at least one operational processor 



(CPUh) in the system; and 

abnormally (102) ending the terminated pro- 
gram if there is no operational processor 
(CPUh) in the System. 

17. A method of continuing the execution of a 
program as defined in claim 12, further com- 
prising: 

selecting (34) an operational processor (CPUh) 
as the continuing processor for continuing the 
execution of the program terminated on the 
failed processor (CPUfj as the first operational 
processor (CPUh) tb become enabled for in- 
terruptions after the malfunction alert signal- 
ling. 

18. A method of continuing the execution of a 
program as defined in claim 17, selecting op- 
eration further comprising: 

interrupting an operational processor (CPUh) to 
select a processor to continue the execution of 
the terminated program. 

19. A method of continuing the execution of a 
program a£ defined in claim 18, selecting op- 
eration further comprising: 

selecting (111) a processor (CPUh) different 
from the interrupted operational processor for 
continuing the execution of the terminated pro- 
gram. 

20. A method of continuing the execution of a 
broqfam as defi ried in claim 1 ^le^iri^bp- 
eration further comprising: 

selecting (111) the interrupted . operational pro- 
cessor as the processor (CPUh) for continuing 
the execution bf the terminated program: 

21. A method of continuing the execution of a 
program as defined in claim 20, further com- 
prising: 

preparing for the continued execution of the 
terminated program by initiating an operating 
system routine on the selected processor 
(CPUh) by fetching an address in a machine 
check new P$W in the saved information for 
the terminated program on the failed processor 
(CPUf). 

22. A method of continuing the execution of a 
program as defined in claim 21, further com- 
prising: 
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executing (112) the operating system routine 
on the selected processor (CPUh) to read a 
checkstop indicator field in the machine check 
interruption code (MClCf) stored, in an area in 5 
main memory assigned to the failed processor 
(CPUf) in order for the selected processor 
(CPUh) to learn that the failed processor 
(CPUf) has failed, and continuing (116) the 
execution of the terminated program only if the 10 
checkstop indicator field (CSLO) indicates a 
checkstopped condition. 

23. , A method of continuing the execution; of a 

program as defined in claim 22, further com- 75 
prising: . , . . . ; , ; . 

abnormally (113) ending the terminated pro- 
gram if no checkstop condition is indicated in - t , 
the checkstop indicator field (CSLO) in the 20 
machine check interruption code (MCIC). 

24. A method of continuing the execution of a 
program as defined in claim 22, the execution 
of the operating system routine further com- 25 
prising: 

testing (114) the state of the validity bits in the 
machine check interruption code (MCIC); and 



abnormally (117) ending the terminated pro- 
gram if any validity bit indicates an invalid 
state in /the machine check interruption pode 

(Mbic); ' : " ' ' ' 

25. A method of continuing the execution of a 
program as defined in claims 22 or 23. execut- 
ing the operating system routine * and ^furjlier 
comprising the steps "of: 

accessing (118) the register contents for which 
Validity bits indicate .... )V , , 

using (1.19) the valid register contents, to iden- 
tify resources , allocated to the abnormally jend- 
ed program; and 



30 



35 



40 



45 



releasing (19) the resources from the abnor- 
mally ended program by action of an operation 
system. so 



detecting a solid error in the hardware of the 
processor when a predetermined number of 
repeated executions have occurred without the 
instruction executing error-free; and then 

initiating the process defined in claim 1. 

27. A , 'method of continuing the execution of a 
program as defined in claim 1 by controlling 
the process of claim 1 with the steps of: 

detecting (2) an error in the hardware . of the 
processor when, instruction f execution has 
occurred with an error condition; and then 

initiating the process defined in Claim 1 even 
though . the error condition of the, processor 
hardware, is an intermittent type , of error con- 
dition. ; ■ v f . 

28. A method of continuing the execution of a 
program as definejd in claim 1 by controlling 
the process of claim 1 with the steps of: 

detecting degradation in the state of the hard- 
ware of a processor caused by occasional re- 
moval of failing hardware components; 

determining (5) when the degradation state of 
the processor has exceeded a predetermined 
, threshold level ; and , ( , : , 

initiating the, process defined in claim 1 when 
the predetermined threshold level. h (£p); has 
been exceeded for continuing a current task on 
another processor (CPUh) before .taking the 
processor^^ 

tenance. * j ,-,'*{.-, - u . 

29. A method of continuing the execution of a 
program as defined i n claim 28, in which the 
determining step^comprises: ^ , . / 1,,^,, : • 

testing the performance rate of the processor 
in relation to a predetermined processor per- 
formance rate to find when the predetermined 
threshold level ^PD) has been exceeded. 



26. A method of continuing the execution of a 
program as defined in claim 1 by controlling 
the process of claim 1 with the steps of: 

repeating (1 ) the execution of an instruction a 
plural ity of times as long as the instruction 
execution is detected to have an error, and 
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FIG. 1 



START 



CPUf EXECUTES AN INSTRUCTION 
IN A PROGRAM TASK 



2. CPU f DETECTS HARDWARE ERROR 
FOR THE INSTRUCTION 



3. CPUf SENDS ERROR SIGNAL TO SP 

1 

— 4. CPUf ERROR TYPE BY SP? 



(NOT 
RETRYABLE) 



(RETRYABLE) 



(RETRY 
ATTEMPT) 



5. CPUf TESTS IF INSTRUCTION NO 
RETRY THRESHOLD EXCEEDED? 



YES 



(CHECKSTOP 
ERROR) 



10. SP 

CHECKS 
IF PD 
THRESHOLD 
EXCEEDED?, 



NO 



1 7. CHECKSTOP 
CPUf 



TO PRIOR 
PROCESS 



SOLID ERROR 
K' DETECTED 



TO NOVEL PROCESS 
YES 



TO 



SP DETERMINES IF PD 
THRESHOLD EXCEEDED? 

NO 



'17. SP SETS MCICf 
: i AND PD BIT & 
BACKUP BIT 



12. SP CHECKSTOPS 
CPUf 



13. SP SENDS MFA 
SIGNAL TO OTHER 
CPUs (SEE FIG;3) 



8. SP TELLS CPUf 
TO TAKE MC < 
INTERRUPTION 



11. SP SETS 
PD BIT ON 
AND SETS 
BACKUP (B) 
BIT OFF 



9. CPUf ASSURES 
ALL OUTSTANDING 
STORES ARE 
MADE IN MS 



TO FIG. 2 



.14. ANY ENABLED CPU 
, . i TAKES EXTERNAL ! 
v ! INTERRUPT AND OS 
; ABENDS FAILED TASK 

t * 

15. OS RUNS SYSTEM WITH; 
REMAINING HEALTHY 
CPUs (WITHOUT CPUf) 

I 

16. ABENDED TASK MAY 
OR MAY NOT ATTEMPT 

TO RECOVER USING ANY 
RECOVERY BUILT INTO 

APPLICATION SOFTWARE 
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FIG. 2 



PRIOR ART 



FROM FIG. 1 



21 . CPUf EXECUTE OS RECOVERY PROGRAM 
i r AT NEW-MC-PSW LOCATION 



■ . ■-. > NO 

22. ERROR FOUND IN OS ? V " > 



NO 



THEN ERROR IS 
-n? q? ,IN APPLICATION 
8 CM oh A PROGRAM TASK 



23. CAN OS RECOVER ? 

YES 



CRASH 
SYSTEM 



ABEND 
APPLICATION 
PROGRAM TASK 



(CONTINUE 
OS AND TASK 
EXECUTION) 



EP 0 505 706 A1 



FIG. 3 

PRIOR ART 



(MALFUNCTION ALERT (MFA) SIGNALLING) 



(SP STARTS MFA PROCESS WHEN A 

CHECKSTOP SIGNAL IS REQUIRED FOR CPUf) 



1. SP WRITES MACHINE CHECK (MC) CHECKSTOP 
CODE INTO PRIVATE HSA AREA OF CPUf 



t : 

CPUf IDENTIFIER (ID) INTO PRIVATE 
OF EVERY OTHER CPU IN SYSTEM 



SP WRI 
HSA AREA 



sen 



33.! SP SIGNALS ALL CPUs EXCEPT CPUf TO 
: TAKE" AN MFA EXTERNAL INTERRUPTION 



34. FIRST CPU TO BE INTERRUPT-ENABLED 
WILL TAKE THE MFA EXTERNAL INTERRUPT 



END MFA PROCESS 
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FIG. 4 
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FIG. 5 



FAILING 5TG ADDR 
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CRs 
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CPU TIMER 
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PIG- 8A 



IN H5A OF CPUF,: , 
SIGP STATUS BLOCK 

i ■ ■ 



A 



CHECK STOP. FIELD 
(HAS CHECKSTOP 
STATUS OF CPU f ) 



FIG. 8B 



Ac -i !U ! -'J • r*C"1 U.JO jr-. 

IN HSA OF CPUh: 

EXTERNAL INTERRUPT ID BLOCK 




CPU ID FIELD 
(HAS ID OF CPUf) 
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FROM FIG. 

SOLID ERROR V 
DETECTED 2— j 



FIG. 9 



91 . CPUf SENDS CHECKSTOP SIGNAL TO SP 

I 

92. SP SIGNALS CPUf TO STORE ITS REGISTER DATA 
(GPRs,FPRs,CRs,ARs,ETC. IN CPUf) 
AND ALL CPUf OUTSTANDING STORES 



( NOT, STORED 



(STORED) 



93. SP ACCESSES CPUf & STORES ITS MC INTERRUPTION 
DATA IN LOGOUT AREA IN PSA OF CPUf IN MS 



94. SP COMPLETES ANY OUTSTANDING STORES OF CPUf 



(NO ERROR) 



(ERROR) 



95A. SET SLV BIT=1 95B. SET SLV BIT=0 



96. 



SP SETS VALIDITY BITSvPD BIT.CSLO BIT, ETC., 
IN MCICf IN CPUf PSA 



97. SP SETS MC OLD PSW IN CPUf PSA 

98. SP SETS CPUf IN CHECKSTOP STATE 



99. SP SIGNALS MFA TO ALL CPUs 
EXCEPT FOR CPUf (SEE FIG. 3) 



101. DOES A USABLE CPUh EXIST? 

YES 



NO 



%9> 

TO FIG. 10 



102. ABEND CPUf 
TERMINATED 
TASK 
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FROM FIG. 9 



FIG. 10 



V : 



111.: ANY CPU TAKES MFA. EXT I NTRPT, RECEIVES THE CPUf ID, 

VERIFIES CHECKSTOP STATE OF CPU f , & SELECTS A CPU 
-■\t ^ AS CPUh. - 

« ■;■ • ■■ , -, ,,,, .-. : ".' ' ■ 

112. OS ON CPUh READS MCICf IN MS 

113. OS TESTS CSLO BIT SET ON IN MCIC? 

YES 120. OS ABENDS CPUf 

TERMINATED TASK 



NO 



111 OS TESTS VALIDITY BITS NO (INVALID) 

SET ON IN MCICf? 



(VALID) 



117. OS ABENDS TASK OF CPUf 



118. OS PROVIDES REGISTER 
CONTENT IN PSAf 
TO ABENDED TASK 



119. OS RELEASES RESOURCES 
ALLOCATED TO CPUf TASK 
BY LOOKING AT SAVED 
CONTENTS OF CPUf 
REGISTERS TO INCREASE 
SYSTEM EFFICIENCY 



115. OS CONTINUES INTERRUPTED CPUf TASK 
ON CPUh USING SAVED CONTENTS 



116. cpuh continues cpuf terminated 
Task to successful completion 
(failure of cpuf is not apparent 
to the cpuf task- without any 
abend being signalled) 
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